A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer
نویسندگان
چکیده
Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit.1 We build a web application, AraComLex (Arabic Computer Lexicon),2 for managing and curating the lexical database.
منابع مشابه
An Open-Source Finite State Morphological Transducer for Modern Standard Arabic
We develop an open-source large-scale finitestate morphological processing toolkit (AraComLex) for Modern Standard Arabic (MSA) distributed under the GPLv3 license.1 The morphological transducer is based on a lexical database specifically constructed for this purpose. In contrast to previous resources, the database is tuned to MSA, eliminating lexical entries no longer attested in contemporary ...
متن کاملAn Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks
Morphological ambiguity is a major concern for syntactic parsers, POS taggers and other NLP tools. For example, the greater the number of morphological analyses given for a lexical entry, the longer a parser takes in analyzing a sentence, and the greater the number of parses it produces. Xerox Arabic Finite State Morphology and Buckwalter Arabic Morphological Analyzer are two of the best known,...
متن کاملHandling Unknown Words in Arabic FST Morphology
A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inc...
متن کاملLexical Profiling for Arabic
We provide lexical profiling for Arabic by covering two important linguistic aspects of Arabic lexical information, namely morphological inflectional paradigms and syntactic subcategorization frames, making our database a rich repository of Arabic lexicographic details. First, we provide a complete description of the inflectional behaviour of Arabic lemmas based on statistical distribution. We ...
متن کاملKlex: A Finite-State Transducer Lexicon of Korean
This paper describes the implementation and system details of Klex, a finite-state transducer lexicon for the Korean language, developed using XRCE’s Xerox Finite State Tool (XFST). Klex is essentially a transducer network representing the lexicon of the Korean language with the lexical string on the upper side and the inflected surface string on the lower side. Two major applications for Klex ...
متن کامل